2 research outputs found

    BALANCING THE ASSUMPTIONS OF CAUSAL INFERENCE AND NATURAL LANGUAGE PROCESSING

    Get PDF
    Drawing conclusions about real-world relationships of cause and effect from data collected without randomization requires making assumptions about the true processes that generate the data we observe. Causal inference typically considers low-dimensional data such as categorical or numerical fields in structured medical records. Yet a restriction to such data excludes natural language texts -- including social media posts or clinical free-text notes -- that can provide a powerful perspective into many aspects of our lives. This thesis explores whether the simplifying assumptions we make in order to model human language and behavior can support the causal conclusions that are necessary to inform decisions in healthcare or public policy. An analysis of millions of documents must rely on automated methods from machine learning and natural language processing, yet trust is essential in many clinical or policy applications. We need to develop causal methods that can reflect the uncertainty of imperfect predictive models to inform robust decision-making. We explore several areas of research in pursuit of these goals. We propose a measurement error approach for incorporating text classifiers into causal analyses and demonstrate the assumption on which it relies. We introduce a framework for generating synthetic text datasets on which causal inference methods can be evaluated, and use it to demonstrate that many existing approaches make assumptions that are likely violated. We then propose a proxy model methodology that provides explanations for uninterpretable black-box models, and close by incorporating it into our measurement error approach to explore the assumptions necessary for an analysis of gender and toxicity on Twitter

    BALANCING THE ASSUMPTIONS OF CAUSAL INFERENCE AND NATURAL LANGUAGE PROCESSING

    No full text
    Drawing conclusions about real-world relationships of cause and effect from data collected without randomization requires making assumptions about the true processes that generate the data we observe. Causal inference typically considers low-dimensional data such as categorical or numerical fields in structured medical records. Yet a restriction to such data excludes natural language texts -- including social media posts or clinical free-text notes -- that can provide a powerful perspective into many aspects of our lives. This thesis explores whether the simplifying assumptions we make in order to model human language and behavior can support the causal conclusions that are necessary to inform decisions in healthcare or public policy. An analysis of millions of documents must rely on automated methods from machine learning and natural language processing, yet trust is essential in many clinical or policy applications. We need to develop causal methods that can reflect the uncertainty of imperfect predictive models to inform robust decision-making. We explore several areas of research in pursuit of these goals. We propose a measurement error approach for incorporating text classifiers into causal analyses and demonstrate the assumption on which it relies. We introduce a framework for generating synthetic text datasets on which causal inference methods can be evaluated, and use it to demonstrate that many existing approaches make assumptions that are likely violated. We then propose a proxy model methodology that provides explanations for uninterpretable black-box models, and close by incorporating it into our measurement error approach to explore the assumptions necessary for an analysis of gender and toxicity on Twitter
    corecore